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Aims: To provide a significantly improved probability dis- 
tribution for the H-test for periodicity in X-ray and y-ray ar- 
rival times, which is already extensively used by the y-ray pul- 
sar community. Also, to obtain an analytical probability dis- 
tribution for stacked test statistics in the case of a search for 
pulsed emission from an ensemble of pulsars where the signifi- 
cance per pulsar is relatively low, making individual detections 
insignificant on their own. This information is timely given 
the recent rapid discovery of new pulsars with the Fermi-LAT 
t y-ray telescope. Methods: Approximately 10'"* realisations 
of the H-statistic (H) for random (white) noise is calculated 
from a random number generator for which the repitition cy- 
cle is » 10'"*. From these numbers the probability distribution 
P{> H) is calculated. Results: The distribution of // is is found 
to be exponential with parameter A - 0.4 so that the cumula- 
tive probability distribution P{> H) - exp(-/l//). If we stack 
independent values for //, the sum of K such values would fol- 
low the Erlang-K distribution with parameter A for which the 
^li cumulative probability distribution is also a simple analytical 
expression. Conclusion: 

Searches for weak pulsars with unknown pulse profile 
shapes in the Fermi-LAT, Agile or other X-ray data bases 
(-Ih should benefit from the H-test since it is known to be pow- 
(~| erful against a broad range of pulse profiles, which introduces 
CIh only a single statistical trial if only the H-test is used. The new 
X, probability distribution presented here favours the detection of 
$_i weaker pulsars in terms of an improved sensitivity relative to 
c/2 the previously known distribution. 
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1. Introduction 

When searching for a periodic signal in X-ray or y-ray ar- 
rival times dominated by noise, we may either perform a blind 
search for y-ray pulsars as demonstrated by the Fermi-LAT 
Collaboration (Abdo et al. I2009I I. or, search for such a signal 
where the frequency parameters have been prescribed by con- 
temporary radio data ( Weltevrede 120 1 Oi l . Following the folding 
of say A^ arrival times fi, ..., tui modulo the pulsar spin parame- 
ters, we arrive at a set of phases 6*,, / - 1, ...A^. However, in the 
case of blind searches, Atwood et al. 120061 introduced a time 
differencing technique in which case the number of trial peri- 
ods is significantly reduced. 

de Jager, Swanepoel & Raubenheimer ( I1989I hereafter 
DSR) reviewed the general class of Beran statistics (Beran 
119691 1, from which the most general test statistics such as 
Pearson's ^^, Rayleigh and Z^ statistics are derived, and from 
within this class they derived the well known H-test for X-ray 
and y-ray Astronomy. 

The probability distribution of the H-test statistic as given 
by DSR was derived from Monte Carlo simulations employ- 
ing ~ 10*^ simulations. The computational power and random 
number simulators on typical IBM machines during the 1980's 
had limited ranges of applicability and the H-test suffered ac- 
cordingly. For values of // < 23 we found that the probability 
distribution was exponential with parameter A - 0.398 (or 0.4), 
whereas a hard tail developed for H > 23, which resulted in a 
significant compromise in sensitivity. 

The old version of the H-test probability distribution is al- 
ready extensively used by e.g. the Fermi-LAT Collaboration 
for pulsar searches (e.g. Abdo et al. 120091 and Weltevrede et al. 
1201 Oi l, and from this paper it will become clear that the signif- 
icances assigned to pulsar detections (or non-detections) may 



be too conservative, so that some pulsars may be missed, espe- 
cially when many trial periods are involved, so that large values 
of the //-statistic are required for a significant detection . In this 
Letter we notify the community that all previous published sig- 
nificances from the H-test should be reassessed, based on the 
new improved distribution presented below. Before we do so, 
we first briefly review the origin and properties of the H-test. 



2. The Beran class of test statistics - towards the 
H-test 

Let be the pulsar phase measured on the interval [0, In), so 
that a full rotation corresponds to 27r. Assuming noise (e.g. 
from cosmic rays) are also present such that the pulsed fraction 
is /; < 1 . Let fs(0) be the observed line-of-sight pulse profile 
in the absence of noise. The case p - Q then corresponds to 
no signal (pure noise), whereas p - \ corresponds to no noise 
(pure pulsed signal). The observed light curve f{6) can then be 
represented as a mixing of the noise and signal distributions 
(seeEqn(2)ofDSR lT989] l 
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The general form of the Beran statistic is given by Beran (I1969I I 
in the form (see also DSR 1989) 
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It is thus clear that the Beran statistic measures the integrated 
squared distance between the pulse profile and uniformity, so 
that if p — » 0, then i/r — > as well, or, if /j = l/27r (i.e. a 
flat uniform distribution), then i/^ = as well. Thus, we would 
reject uniformity if i/r exceeds a chosen positive critical value. 
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It was noted by DSR that by replacing f{6) with a density 
estimator f{d) based on the observed folded phases 0,, we re- 
trieve test statistics specific to the kernel of the density estima- 
tor Selecting the Fourier series estimator /,„ with m harmonics 
(see Eqn (5) of DSR) as representative of the light curve, results 
in the well-known Z,^, given the proper normalisation (Eqn (7) 
of DSR) 



IjTNMn). 



(3) 



Since t//{f) scales with p^ (Eqn|2]), it is then clear that the statis- 
tic Z,^, should scale as p^N. The quantity X - p y/N is then also 
the approximate Gaussian significance of the point source on 
the skymap if the background level is well known. 

The main problem raised by DSR is that we do not know 
a-priori the optimal number of harmonics to select. In the case 
of the popular Zf^ test, the optimal number of harmonics would 
depend on X and the pulse profile shape. 

The rationale behind the H-test was to find a consistent es- 
timator for f{6) where the number of harmonics m is not cho- 
sen subjectively by the observer, since selecting a number of m 
values for Z,^,, until a pleasing result is obtained, may lead to a 
false detection, given the difficulty to keep track of the number 
of trials involved. 

Hart ( 119851) derived a technique to obtain the optimal num- 
ber of harmonics M such that the mean integrated squared error 
(MISE) between the Fourier series estimator fuifi) and the true 
unknown light curve shape is a minimum (see DSR for a re- 
view). Thus, /m(0) would give the best Fourier series represen- 
tation of the true pulse profile, given the constraints imposed 
by the statistics and inherent pulsed fraction. Since the mini- 
mum of the MISE involves the finding of a maximum quantity 
over all harmonics ;«, DSR redefined this maximal (optimal) 
quantity in terms of the so-called H-test statistic (named after 
Hart): 



H = max {Zi 

l<m<2() 



Am -H 4) = Zf. - 4M -H 4 > 0. 



(4) 



Uniformity would correspond to low values of M and also 
low values of H, whereas large values of Z^ (relatively strong 
pulsed signal) corresponds to large values of H. Subtracting 
AM and adding 4 (to ensure positive values of H only) effec- 
tively limits the variance of H in the absence of a pulsed signal. 
It is clear that the H test represents a rescaled version of the Z^, 
test: if M - 1, then we retrieve the well-known Rayleigh test. 
DSR have shown that this test is powerful against a wide range 
of alternatives, which makes this test attractive if the pulse pro- 
file shape is not known a priori. 

To give an example of how H and M depend on the above- 
mentioned significance X — p ^In given real data, we selected 
archival photon arrival times from the public Fermi-LAT data 
base from the Vela pulsar direction, with events selected ac- 
cording to the Fermi-LAT point spread function, but with en- 
ergies > 500 MeV. Folding these phases with the given con- 
temporary spin parameters of Vela reveals a pulsed fraction of 
p ~ 0.96, i.e. nearly 100% and from A^ = 600 events (4 days 
integration) we akeady obtain M = 20. By adding randomly 
generated events (i.e. uniformly distributed pulse phases) to the 
signal events we effectively reduce p and hence X, so that H 
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Fig. 1. The dependence of H and M on the approximate skymap 
significance X for y-ray pulse phases above 500 MeV from the 
Vela pulsar as described in the text. The errors represent stan- 
dard deviations based on 20 independent Fermi-LAT Vela pul- 
sar data sets, each of length up to ~ 4 days. The horisontal line 
represent the upper limit of M = 20, whereas the dashed line 
represent a fit of the form H —\ .9X^ for X > 3. 



and M should also decrease accordingly. Figure 1 shows how 
H and M relates to X: For X > 3 (stronger than a ~ 3cr signal) 
we see that H scales with X^, with H - L9X^, whereas the 
optimal M is already > 10 for X > 1. 

This figure is also quite useful to see what typical H and M 
values we may expect for a typical Vela-like pulsar if we know 
the strength of the signal as derived from a point source on the 
skymap, and assuming the excess is due to such a pulsed signal. 

3. The revised probability distribution of H for 
uniformity 

The calculations were performed on the Institutional Cluster 
of the North-West University, Potchefstroom campus. To par- 
allelize the computations, we used the RngStream pack- 
age (L'Ecuyer 12002) . which guarantees independent, non- 
overlapping substreams of random numbers. The repitition cy- 
cle for this random number generator is 3 x 10^^, which is cer- 
tainly large enough for our purposes. A total of 4x lO'"* samples 
were calculated in this way. 

In the case of large statistics (A^ > 100) we do not need to 
simulate individual arrival times directly (see the approximate 
correction factors in Figures 1 and 2 of DSR in the case of low 
statistics - N < 100), so that we only need to simulate the Z^ 
statistics directly: Since Z,^, is the sum of m;i'2"Statistics, we can 
simulate a;t'2 statistic directly from a uniformly distributed ran- 
dom number re [0, 1) by taking the transformation -2 In r and 
adding these numbers to give Z^. This speeds up the process 
considerably. A total of 4 X lO'"* values of H were simulated 
in this way and the results are shown in Fig. 2. The distribu- 
tion is everywhere consistent with an exponential distribution 
with parameter A = 0.398 (or 0.4), except for // > 70 where 
a downturn relative to the 0.4 index is possibly seen. DSR ar- 
rived at the same precise value of /I = 0.398 for small values of 
H (i.e. less than 23) since the random number generator used 
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by DSR did not yet reach the limit of its random cycle for the 
number of simulations required to reach H ~ 23 (with a rela- 
tively small error on the coiTesponding probability) and should 
therefore reveal the same result as ours for // < 23. 

It is thus clear that the probability distribution of the H- 
statistic can be conservatively described by the simple formula 



Pro/7(>//) = exp(-0.4//). 



4. Incoherent Stacking 



(5) 



Suppose we analysed the data from K pulsars, or, K indepen- 
dent observations of the same pulsar, where the effect of e.g. 
unrecorded glitches excluded the possibility of analysing all 
the data as one single coherent set of arrival times. In this case 
we want to see if there is a net signal represented by K values 
of the //-statistic. Suppose we arrive at a set of K such values 
of //, given by //,, i - I, ..., K. 

Since we have shown (to the probability level of ~ 10 '"^) 
that the //-statistic follows that of an exponential distribution 
with parameter A = 0.4, we can stack these test statistics 
through the sum 



Ht - ^i=\Hi 



(6) 



which is known to follow the Erlang-K distribution with param- 
eter A, so that the significance (or probability for uniformity) of 
such stacking is given by the simple analytical expression (see 
e.g., Leemis & McQueston 2008 on univariate distribution re- 
lationships, which includes the Erlang distribution as the sum 
of independent exponential variables with parameter A) 



P(> Ht\K, A) ^ I.' 



i,_iSxp(-AHT)(AHTr 



n=0 ' 



(7) 



5. Conclusions 



For M = 1 we retrieve the well-known Rayleigh test, with 
the exception that the parameter for the exponential distribu- 
tion has been reduced from A - 0.5 (for the Rayleigh test) to 
A - 0.4 for the H - test. This slight loss of sensitivity is the 
effect of the number of trials taken implicitly into account as a 
result of the search through m within the H - test. 

Finally, it is clear that the coiTected distribution of the //- 
statistic follows a simple exponential with parameter A = 0.4 
and evaluation of results for // > 23 (i.e. the lO"** significance 
level) would yield more significant results compared to the 
old distribution presented by DSR. For example, for H - 50 
a probability of 4 x 10"^ is typically quoted in the literature, 
whereas the true probability for uniformity is actually 2 x 10"^ 
- already a factor 20 smaller. 

A Fermi-LAT example of the Vela pulsar (> 500 MeV) 
shows clearly values for M up to 20 for "skymap" significances 
X = p V^ > 20, whereas H scales with // oc X^ as expected 
for Beran-type tests. The scaling H = 1.9X^ can be used to 
predict //-test statistics for Vela-like pulsars above 500 MeV if 
we assume the excess on the skymap is all pulsed. 

The hard tail of the distribution beyond // > 23 presented 
by DSR probably arose from the repitition cycle of the random 




MODEL: EXP(-0.4H) 
de Jager et al. (1989) 
SIMULATIONS 
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Fig. 2. The distribution of the H-statistic derived from 4 x 10'^ 
Monte Carlo simulations with the best fit model for the cumu- 
lative probability P(> H) - exp(-0.4//) (soHd line) and the 
old version of the probability distribution given by DSR shown 
as a dashed line. 

number generator used in those days, so that the same fluctu- 
ations at large H values were repeated given the finite cycle 
length of the generator used. In this case we however used a 
generator with a cycle time much longer than 10'^, in which 
case we did not see the repitition of outliers as a result of a finite 
cycle length. Confirmation of the possible break (i.e. downturn) 
in the probability distribution at // > 70 requires extensive sim- 
ulations beyond 4 x 10'^ and is beyond the scope of this paper. 
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